```{python}
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, DBSCAN
from sklearn.model_selection import train_test_split, ParameterGrid
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
```Spotify Recommendation System With Clustering
Author: Daniel Hassler
Data Analysis
Clustering algorithms can be applied to many real-world applications, including but not limited to security, anomaly detection, document clustering, stock market analysis, image compression, and so much more. The application I decided to approach with clustering is a song recommendation system. I found a dataset on Kaggle containing almost 114,000 songs from the popular music streaming platform Spotify. Each entry in the dataset consists of many features including artists, track_name, track_genre, popularity, danceability, and many more.
Below are some visualizations showcasing certain features in a scatterplot. This gives me a rough idea what the dataset looks like with all of these features and genres.
```{python}
original_df = pd.read_csv("./dataset.csv")
print(original_df.columns)
features_x = ["loudness", "popularity", "duration_ms"]
features_y = ["popularity", "energy", "tempo"]
for i, (x,y) in enumerate(zip(features_x, features_y)):
scatter = sns.scatterplot(x=x, y=y, hue='track_genre', data=original_df, palette="viridis", alpha=0.25)
legend_labels = original_df['track_genre'].unique()# [:3] # Show only the first 3 genres
scatter.legend(title='Genre', labels=legend_labels, prop={'size': 1})
plt.title(f"Scatter Plot of {x} vs {y} by genre")
plt.show()
plt.show()
```Index(['Unnamed: 0', 'track_id', 'artists', 'album_name', 'track_name',
'popularity', 'duration_ms', 'explicit', 'danceability', 'energy',
'key', 'loudness', 'mode', 'speechiness', 'acousticness',
'instrumentalness', 'liveness', 'valence', 'tempo', 'time_signature',
'track_genre'],
dtype='object')
K-Means
The K-Means algorithm clusters data by minimizing a criteria known as intertia, the within-cluster sum-of-squares. The formula for inertia, specified in the K-means documentation for Sklearn, is noted below:
Noting some of the variables in the summation: * n is the number of datapoints * mu is the mean of the cluster, also the cluster_centroid of the cluster C * ||x_i - \mu||^2 represents the squared euclidean distance between point x_i and the centroid * min() takes the min of the calculation
\[ \sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2) \]
A great benefit to K-means is its scalability to large sample sets.
Hyperparameter Tuning
```{python}
inertia = []
# train_df is the numeric representation of original_df
train_df = original_df.drop(columns=["track_id"])
for col in train_df.columns:
if not pd.api.types.is_numeric_dtype(train_df[col]):
train_df[col] = pd.factorize(original_df[col])[0]
scaler = StandardScaler()
# df_scaled is the scaled version of train_df
df_scaled = scaler.fit_transform(train_df)
for k in range(1, 200, 10):
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit_predict(df_scaled)
inertia.append(kmeans.inertia_)
```D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
Plotting the Elbow Chart
```{python}
plt.plot(range(1, 200, 10), inertia, marker='o')
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.show()
```K-Means for Spotify
```{python}
kmeans = KMeans(n_clusters=23, random_state=42)
original_df['clusters'] = kmeans.fit_predict(df_scaled)
```D:\Users\dwh71\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\cluster\_kmeans.py:1416: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
super()._check_params_vs_input(X, default_n_init=10)
Evaluating K-Means for Spotify
```{python}
original_df['Distance_to_Centroid'] = kmeans.transform(df_scaled).min(axis=1)
def get_nearest_entry(idx, k=5):
# print(original_df.iloc[idx])
# print(train_df.iloc[idx])
cluster = kmeans.predict(df_scaled[idx].reshape(1,20))[0]
cluster_data = original_df[original_df["clusters"] == cluster]
cluster_data["closest_entries_to_idx"] = (original_df["Distance_to_Centroid"] - cluster_data.loc[idx]["Distance_to_Centroid"]).abs()
cluster_data = cluster_data.sort_values(by="closest_entries_to_idx")
cluster_data.drop(columns=["closest_entries_to_idx"])
print(f"Top {k} Closest Examples to {cluster_data.loc[idx]['artists']}'s \"{cluster_data.loc[idx]['track_name']}\"")
print(cluster_data[:5][["artists", "album_name", "track_name", "track_genre"]])
print("\n\n")
get_nearest_entry(44151) # rock song
get_nearest_entry(19015) # country song
get_nearest_entry(51136) # rap song
```Top 5 Closest Examples to Daughtry's "Home"
artists album_name track_name \
44151 Daughtry Daughtry Home
22159 Currents The Death We Seek The Death We Seek
6552 Dark Funeral Where Shadows Forever Reign Unchain My Soul
3355 Shinsei Kamattechan Boku no Sensou Boku no Sensou
44252 Linkin Park A Thousand Suns Robot Boy
track_genre
44151 grunge
22159 death-metal
6552 black-metal
3355 alternative
44252 grunge
Top 5 Closest Examples to Florida Georgia Line's "Stay"
artists album_name \
19015 Florida Georgia Line Sad Country Songs
7189 Leftover Salmon High Country
727 Takehara Pistol youth
10465 Keith Mackenzie;DJ Fixx;Whiskey Pete Blazing
31547 Jax Jones;Years & Years Queda poco para la PAES
track_name track_genre
19015 Stay country
7189 High Country bluegrass
727 全て身に覚えのある痛みだろう? acoustic
10465 Blazing breakbeat
31547 Play electro
Top 5 Closest Examples to Future;Lil Uzi Vert's "Tic Tac"
artists album_name \
51136 Future;Lil Uzi Vert Tic Tac - Just Rap
15533 A1 x J1;Nemzzz Don’t Lie (feat. Nemzzz)
39817 FFRAGEZEICHEN FR EP 2
53209 Tiësto;KAROL G Don't Be Shy
24700 Moodymann Moodymann
track_name track_genre
51136 Tic Tac hip-hop
15533 Don’t Lie (feat. Nemzzz) chill
39817 Passed Out german
53209 Don't Be Shy house
24700 No detroit-techno
C:\Users\dwh71\AppData\Local\Temp\ipykernel_4172\3089484603.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
cluster_data["closest_entries_to_idx"] = (original_df["Distance_to_Centroid"] - cluster_data.loc[idx]["Distance_to_Centroid"]).abs()
C:\Users\dwh71\AppData\Local\Temp\ipykernel_4172\3089484603.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
cluster_data["closest_entries_to_idx"] = (original_df["Distance_to_Centroid"] - cluster_data.loc[idx]["Distance_to_Centroid"]).abs()
C:\Users\dwh71\AppData\Local\Temp\ipykernel_4172\3089484603.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
cluster_data["closest_entries_to_idx"] = (original_df["Distance_to_Centroid"] - cluster_data.loc[idx]["Distance_to_Centroid"]).abs()